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This paper considers a proportional hazards model, which al- 
lows one to examine the extent to which covariates interact nonlin- 
early with an exposure variable, for analysis of lifetime data. A local 
partial-likelihood technique is proposed to estimate nonlinear interac- 
tions. Asymptotic normality of the proposed estimator is established. 
The baseline hazard function, the bias and the variance of the local 
likelihood estimator are consistently estimated. In addition, a one- 
step local partial-likelihood estimator is presented to facilitate the 
computation of the proposed procedure and is demonstrated to be as 
efficient as the fully iterated local partial-likelihood estimator. Fur- 
thermore, a penalized local likelihood estimator is proposed to select 
important risk variables in the model. Numerical examples are used 
to illustrate the effectiveness of the proposed procedures. 

1. Introduction. One of the most celebrated models for analyzing life- 
time data is the Cox proportional hazards model, which explicitly postulates 
the covariate effects on the hazard risk via 

X(t) = X (t) exp{(7(Z)}, 

where Ao(-) is the baseline hazard risk and g(Z) reflects the covariate effect. 
In parametric models it is commonly assumed that 

g(Z)=(3 T Z 

for some unknown parameters f3. See, for example, [1] and [20]. The log- 
linear model is a simple and mathematically convenient model that provides 
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useful analysis for a covariate effect. However, in many biomedical studies, 
the covariate effects can be more complicated than the log-linear effect and 
new analytic challenges arise in assessing nonlinear effects. Beyond the tra- 
ditional linear model, there are infinitely many possible nonlinear forms. 
Depending on the background of study, one often chooses a form that rea- 
sonably explains the objective of the study. For example, the effect of ex- 
posure variables and confounding factors on the hazard risk may vary with 
the level of an exposure variable, denoted by W. This leads one naturally 
to consider the model 

(1.1) A(i) = A„(t) e W {(3(W(t)) T Z(t) + g(W(t))}. 

Here f3(-) and g(-) are unknown coefficient functions, characterizing the ex- 
tent to which the association varies with the level of the exposure variable 
W. Note that the term g(W(t)) can be incorporated into the covariates Z(t) 
by introducing a dummy variable with column one. We opt to not do so, 
because the local intercept for <?(•) will cancel out in the local partial like- 
lihood (2.3) below, leading to a different estimator rule for g. For ease of 
presentation, we drop the dependence of covariates on time Xj, with the 
understanding that the methods and proofs in this paper are applicable to 
time-dependent covariates. 

When the variable W is time, rather than a covariate variable, model 
(1.1) becomes a time-dependent coefficient Cox model, which has been stud- 
ied by a number of authors, including Zucker and Karr [37], Murphy and 
Sen [31], Gamerman [21], Murphy [30], Marzec and Marzec [28], Marti- 
nussen, Scheike and Skovgaard [27], Cai and Sun [10], and Tian, Zucker and 
Wei [32]. In this case, unless the coefficient functions (3(t) are independent 
of time t, the model is no longer a proportional hazards model. In contrast, 
model (1.1) is still a proportional hazards model. It allows one to examine 
the extent to which covariates Z interact nonlinearly with the exposure vari- 
able W. As will be explained later, although model (1.1) looks similar to the 
time-dependent coefficient Cox model, it is more involved when establishing 
asymptotic properties. 

The varying-coefficient models arise from many different fields and have 
been studied in many different contexts. For cross-sectional type data, they 
have been studied as models to explore nonlinearity and assess nonlinear 
interactions by Cleveland, Grosse and Shyu [14], Hastie and Tibshirani [24], 
Carroll, Ruppert and Welsh [12], Fan and Zhang [19] and Cai, Fan and Li [8], 
among others. In time series, they are extensions of threshold autoregres- 
sive models and have been used to enhance the predictive power of linear 
autoregressive models. See, for example, [13] and [9]. The varying coefficient 
models have also been widely used to analyze longitudinal data. They allow 
one to examine the extent to which the association between independent 
and dependent variables varies over time. See, for example, [7, 25, 35, 36]. 
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In this paper we propose techniques for estimating the coefficient func- 
tions using local linear techniques [15]. The asymptotic bias and vari- 
ance are obtained by establishing asymptotic normality. The variance is then 
estimated via a sandwich formula, which is shown to be consistent. To save 
computation of the local partial- likelihood estimator, a one-step procedure 
is proposed, which is shown to have the same asymptotic bias and variance 
as the local partial-likelihood estimator. Implementation of the proposed es- 
timator depends on the choice of good initial estimators: estimates at the 
nearest grid points are recommended. The resulting procedure is demon- 
strated to be quite effective in our numerical implementation. In addition, 
the baseline hazard function Ao(-) is estimated via a kernel method. The 
consistency property is demonstrated. 

An objective of survival analysis is to identify the risk factors and their 
risk contributions. At the initial stage of a study, many covariates are col- 
lected to reduce possible modeling biases, and a large model is built, namely 
the dimensionality of Z in (1.1) is high. An important and challenging task is 
to efficiently select a subset of significant variables from model (1.1). Fan and 
Li [17] proposed a family of new variable selection methods based on a non- 
concave penalized likelihood. Their methods are different from traditional 
ones in that they delete insignificant variables by estimating their coefficients 
as 0, and simultaneously select significant variables and estimate regression 
coefficients. Lasso, proposed by Tibshirani [33, 34], is a member of this fam- 
ily with an L\ penalty. From their simulations, Fan and Li [17] showed that 
the penalized likelihood estimator with smoothly clipped absolute deviation 
(SCAD) penalty outperforms the best subset variable selection in terms of 
computational cost and stability in the terminology of Breiman [5] . In addi- 
tion, they have proven that SCAD improves the lasso in terms of estimation 
biases. Furthermore, they have demonstrated that with a proper choice of 
regularization parameters and penalty functions (such as SCAD), the pe- 
nalized likelihood estimator possesses an oracle property. Namely, the true 
regression coefficients that are zero are automatically estimated as zero and 
the remaining coefficients are estimated as well as if the correct submodel is 
known in advance. Hence, the SCAD and its siblings are ideal for variable 
selection, at least from a theoretical point of view. These nice properties 
encouraged us to extend the technique to the nonparametric model (1.1). It 
gives us a quick and effective method for eliminating unimportant variables. 

The paper is organized as follows. Section 2 introduces the local partial- 
likelihood estimation and establishes the asymptotic normality. One-step 
estimation and estimation of the baseline hazard function are studied in 
Section 3. Section 4 deals with the issue of variable selection. Numerical ex- 
amples are given in Section 5. Technical proofs are relegated to Appendix A. 
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2. Partial-likelihood estimation. Suppose that there is a random sample 
of size n from an underlying population. Let Tj denote the potential failure 
time, let Cj denote the potential censoring time and let = min(Tj,Cj) 
denote the observed time for the ith. individual. Assume that Tj and C% 
are independent given covariates Zi and W%. Let Aj be an indicator which 
equals 1 if Xi is a failure time and otherwise. The covariates Z and W are 
allowed to be time dependent. The observed data structure is 



where Zj = (Zn, . . . , Zi p ) and Wi are two types of covariates, with W being 
an exposure variable of interest. 

When all the observations are independent, the partial likelihood for 
model (1.1) is 



where R(t) = {i : Xi > t} denotes the set of the individuals at risk just prior 
to time t. 

2.1. Local partial likelihood. If the unknown functions /3(-) and g(-) are 
parametrized, the parameters can be estimated by maximizing (2.1). For 
our nonparametric estimation, since the forms of the unknown functions are 
not available, we can only rely on their qualitative traits. 

Assume that every component of (3(-) and g(-) is smooth so that it admits 
Taylor expansion: for each given wq and w around wq, 



Substituting this into (2.1), we obtain the logarithm of the local partial 
likelihood, 



{XuAi^Wi} 



for i = 1, . . . ,n, 



(2.1) 




(2.2) 



f3(w) w f3(w ) + P'(w )(w -w ) = 5 + T](w - w ) 
g(w) « g(w ) + g'(w )(w -w ) = a + 7(10 - w ). 



n 



= n 



(2.3) 




log exp{d T Z j + ri T Z ] (W j -wo)+^(W j -w )} 

\jen(Xi) 
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where if is a probability density called a kernel function, h represents the 
size of the local neighborhood and Kh(-) = K(-/h)/h. The kernel weight is 
introduced to confirm that the local model (2.2) is only applied to the data 
around wq. The local partial likelihood (2.3) can be derived from a profile 
likelihood point of view. The derivation is similar to those of Breslow [6] and 
Fan, Gijbels and King [16]. 

Let 7(wo)j S(wq) and 77(100) be the maximizer of (2.3). Then 0{wq) = 
6(wq) is a local linear estimator for the coefficient function (3(-) at the point 
wo. Similarly, an estimator of </(•) at the point wo is simply the local slope 
7(100), namely g'(wo) = 7(100) • The curve g can be estimated by integration 
on the function g'(wo). Following Hastie and Tibshirani [23], the integration 
can be approximated by using the trapezoidal rule. 

We now express the local partial likelihood using the counting process 
notation. To this end, let Ni(t) = I(Tj < t, A* = 1) and Y^t) = I(Xi > t). Set 

Z = (S T ,r 1 T , 1 f and X* = (Zf , Zf (W { - w Q ), W { - w f . 
Then the local partial-likelihood function (2.3) can be expressed as 

tn(M = n" 1 V [ T K h (Wi - w )Z T X.* dNi(u) 
(2.4) C K h {Wi-wo) 

x log^pYjiu) exp(^ T X*)^(W i - wo) I dNi(u) 

with r = 00. To avoid the technicality of tail problems, only the data up 
to a finite time point r are frequently used. Without ambiguity, we will let 
£(u>o) be the maximizer of (2.4). 

Note that the local partial likelihood in (2.4) is more complicated than 
that for the time-dependent coefficient Cox model. In particular, the kernel 
functions appear twice in the local partial likelihood (2.4), so as to use only 
local data. In contrast, for the time-dependent coefficient model, localizing in 
time once suffices. As a consequence, the technical proofs are more involved 
in the current setting. 

The above method uses only one smoothing parameter to fit all the co- 
efficient functions. When the coefficient functions admit different degree of 
smoothness [e.g., g'(w) often admits a different degree of smoothness from 
other coefficient functions], one needs to use different bandwidths for dif- 
ferent components. The two-step estimation method of Fan and Zhang [19] 
can be adapted here. 
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2.2. Asymptotic normality. We now establish the asymptotic normality 
of the local partial-likelihood estimator. As shown in Appendix A, the local 
partial-likelihood function l n (£,,T~) is concave in £ and its maximizer exists 
with probability tending to 1. Let H be a (2p + 1) x {2p+ 1) diagonal matrix, 
with the first p elements 1 and the remaining p + 1 elements h, where p 
is the number of elements in Z. For any function £(w), w E J, let ||£|| j = 
sup^ g j |£(w)|, for ap- vector a, let |a| = (X)f=i a T) 1 ^ 2 an d || a ll = supj |aj|, and 
for a matrix A, let ||A|| = sup^- \a%j\. Then we have the following consistency 
result. 



Theorem 1. Under Conditions A.1-A.8 in Appendix A, we have 

H{£(™ o )-£oM}-^0, 

where $,q(wo) = (flo (wo) , flb(wo) T , g' Q (wo)) T is the vector of the true param- 
eter functions. If, in addition, Conditions B.1-B.8 hold, then we have the 
uniform consistency 

\\K{£-to}\\jw= ^P |H{£H-£ M}|-^0, 

we Jw 

where Jw is a compact subset of the support of the random variable W . 



To express explicitly the bias and variance of the estimator, we introduce 
some necessary notation. Let 

[/,{ = J x l K(x)dx, Vi = J x % K 2 {x)dx. 

Denote 

P(u,z,wo) = P(X > u\Z = z, W = wo) and 

p(u, z, w ) = P(u, z, w ) exp{/3 (w ) T z + g (w )}. 

For k = 0, 1,2, define 

a*(u, wq) = f(w )E{p(u, Z, w )Z® k \W = w }, 

where /(■) is the density of W and Z 0fc = 1, Z and ZZ T for k = 0, 1 and 2, 
respectively. Additionally set 

a fc = £L k (w ) = [ 8L k (u,w )dA (u). 



We will drop the dependence of ak(u,Wo) and a k (wo) on wo whenever there 
is no ambiguity. Finally, let 

T = T(wo) = |a 2 - ai(n)ai(u) T ao 1 (n)A (n) duj 
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and 

Q = 

where, in fact, ao is a scale 



(a 2 -aiafa 1 ) 1 -(a 2 - aiaf a 1 ) ^lag 1 

— 1 T I T -In — 1 / T -1 \-l 

a a{ (a 2 - a x a^ a ) (a - a.f a 2 ai) 



Theorem 2 (Asymptotic normality). Suppose that Conditions A.1-A.8 
in Appendix A hold. Then 

V^{H(£(tu ) - £o(™o)) - ^'^oVo)^} AT(0, S(r, «*,)), 
where e p is a (2p + \)-order diagonal matrix, with the first p elements 1 and 
the last p+ 1 diagonal elements 0, ^o^) = (flo( w )> flb( w ) T > 9o( w o)) T an d 

S(t )U ;o) = ( t Qfx -2 U2 

The above theorem gives the joint asymptotic normality for the local 
partial-likelihood estimator. Its marginal distribution can easily be obtained 
as in the following corollary. 

Corollary 1. Under the conditions of Theorem 2, we have 
VnJ0(w o ) - p (w ) - h 2 p%(w )v 2 /2} N(0, u T), 



Vnh?{g'(w ) - g' (w )} — > N(0, (a - af a 2 V) V 2 2z/ 2)- 
Furthermore, they are asymptotically independent. 

As a consequence of Theorem 2, the theoretical optimal bandwidth can 
be obtained. 



3. Issues related to partial-likelihood estimation. In this section we dis- 
cussed a few issues that are related to the implementation of the partial- 
likelihood estimator. 



3.1. One-step local partial-likelihood estimator. When estimating the 
whole functions j3{-) and <?(•), we usually need to apply the local partial 
likelihood (2.4) at hundreds of points. Computing such an implicit estima- 
tor requires an iterative algorithm such as the Newton-Raphson method or 
Fisher's scoring method. Even worse, for certain given wq, there does not 
exist a local partial-likelihood estimator due to the limited amount of data 
around wq. These drawbacks make the local partial-likelihood estimator less 
appealing. Following Fan and Chen [15], we propose a one-step estimator as 
a viable alternative. 
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The local partial-likelihood estimator £ is found via solving the likelihood 
equation £' n (£,r) = 0, where £' n (£,r) = d£ n (£, r)/d£. To facilitate notation, 
from now on we drop the dependence of £ n (£,T~) on r. For a given initial 
estimator £ , by Taylor expansion we have 

4&>) -£o)«o. 

Thus, the one-step estimator £ os is defined as 

(3.1) L=L-{<(L)rX(L)- 

A natural question arises: How good an initial estimator £ is needed for 
the one-step estimator to have the same performance as the maximum local 
partial-likelihood estimator. The following theorem gives an answer to this 
question. 

Theorem 3. Under the conditions given in Theorem 2, £ os has the same 
asymptotic distribution as the maximum local partial-likelihood estimator 
provided that 

(3.2) m o -Ho) = P (h 2 + (nhr 1 / 2 ). 

Theorem 3 provides the conditions under which the one-step estimator 
performs as well as the local partial-likelihood estimator. However, it does 
not provide any guidance for choosing an initial estimator. Cai, Fan and 
Li [8] provided a useful strategy for the choice of initial estimators and 
their idea can be adapted to the current setting. The basic idea is first to 
compute the local partial-likelihood estimates at a few fixed points. Use these 
estimates as the initial values of their nearest grid points and obtain the one- 
step estimates at these grid points. For example, in our simulation studies 
we evaluate the functions at n gr ;d = 200 grid points. We first compute the 
maximum local pseudo-partial-likelihood estimators at specific grid points 
U2o,uqq,uioq, tti4o and uisq, and then use them as the initial values for the 
one-step estimator at their nearest grid points. Use the newly computed 
one-step estimates (at points Uig,U2i, U59, «6i, • • • ) as the initial values of 
their nearest grid points to compute the one-step estimates and so on, until 
the one-step estimates at all grid points are computed. Hence, as long as the 
number of grid points is large enough, condition (3.2) holds. 

3.2. Estimation of baseline hazard function. With estimators of (3(-) and 
<?(•), we can estimate the baseline hazard function by using a kernel smooth- 
ing, 

A (t)= / W b {t-x)dA {x), 
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Note that Ao(-) is an estimate of the cumulative hazard function Ao. 

Theorem 4. Under Condition B in Appendix A, we have 
A (i)^A (i) and A (t)^A (t) 
uniformly on (0, r] in probability. 

3.3. Estimation of biases and variances. The biases of nonparametric 
estimates are generally hard to estimate, since they involve higher-order 
derivatives. However, their variances can be estimated quite reasonably. 
Thus, in construction of confidence intervals/bands, the bias components 
are frequently omitted; in particular, undersmoothing procedures have been 
used to make the biases negligible relative to their standard error. See, for 
example, [4, 22, 26]. Some people might argue that this is also the approach 
that parametric methods take — modeling biases are inevitable and they are 
simply ignored in the construction of the parametric confidence intervals. 

The bias and covariance of these local estimators H(£(wo) — £q(wq)) can 
be estimated by 

A^ 1 (T,w )B ri (T,wo) and (nhy 1 A" 1 ^, nj )n n (r, n^A" 1 ^, w ), 
where 

1 n r T 

A n (T,w ) = - V / K h (Wi-w ) 

S n2 (u,wo)S n o(u,w ) - S n i(u,wo)S nl (u,w ^ T 



(S n o(u,W )) 2 

B n (r,w Q ) = -J2 f T K h (Wi-w )(v*(u) - g^l^^ W)A»(n)dn 
n ~{Jo V S n0 (u,w )J 



dNi{u), 



S n o(u,w ). 

U n (r,w ) = -J2 [ T K 2 h (W t - w )(v*(u) - ^"' W °y V,(n)i,( M )4, 
n ~[Jo V S n n(u,w n )J 



with Ai(n) = exp(/3(W i ) T Z i (n) + g(Wi))\ (u), Uf = H^Xf and 



S nk (u,w ) = J2K h (Wi - w )Y i (u)exp(^(w )X*(u))(V*(u)f 



n-hK vy i - wojii^u) cajjv^O K w 0J-^i K u )JK^'i K U JJ > 
i=l 

fc = 0,l,2. 
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Theorem 5. Under the conditions of Theorem 4, we have 
h' 2 k~ 1 {T,WQ)'B n {T,WQ) — ► e p £"(w;o)/i2/2, 

^■n 1 ( T ^ w o)'^-n(T,W )A~ 1 (T,W ) > E(t,Wq) 

in probability. 

In fact, by using the martingale properties, we can construct different 
estimators of B n (r, wq) and II (r, Wq) without estimating the baseline hazard 
function Ao(-)- That is, 

B n (r,w ) = -J2 [ T K h ( Wl -w )(v*(u) - f 1 ^ ) dN r (u), 
nf^Jo V S n0 (u,w )J 

h n (r,w ) = *± [ Kl{ Wi - W Jv*{u) - f l ^ W °\ Y 2 dNtiu). 
nf^Jo V S n0 (u,w )J 

The results of Theorem 5 still hold when the quantities B n (r,Wo) and 
tl n (j, wo) are replaced by B n (T,wo) and n n (r, wq), respectively. 

One can also use the bootstrap method as in [32] to obtain an estimated 
variance for our estimators. In fact, the method is particularly useful for 
estimating the sampling variability of g(w), since its analytic form is hard 
to derive. 

4. Variable selection via nonconcave penalized likelihood. 

4.1. Local penalized likelihood. For the nonparametric model (1.1), it is 
not easy to give a variable selection procedure without going to detailed 
inferences on each coefficient function. Motivated by the work of Fan and 
Li [17, 18], we apply their procedure locally around each grid point w$. This 
results in the penalized log partial-likelihood function 

2p+l 

(4-1) Q{Z)=tn{£,T)-Y,PMj\)> 

where p e {-) is a penalty function. The penalized local partial-likelihood esti- 
mate of £ is to maximize (4.1). With a proper choice of g and a penalty func- 
tion, many estimated coefficients will be zero and hence their corresponding 
variables do not appear in the model at the point wq. This achieves the 
objective of variable selection and results in a simple and implementable 
method to begin with. 

A good penalty function should result in an estimator with the follow- 
ing three properties: unbiasedness for large coefficients to attenuate biases, 
sparsity (many small coefficients are estimated as zero) to reduce model 
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complexity and continuity to avoid unnecessary variation in model predic- 
tion. Necessary conditions for unbiasedness, sparsity and continuity have 
been derived by Antoniadis and Fan [3] and Fan and Li [17]. A simple 
penalty function that satisfies all the three mathematical requirements is 
the smoothly clipped absolute deviation (SCAD) penalty, defined by 

p' e m = eUe < e )+ ( ° e ~f. )+ J(e > e)) 
(4.2) 1 ( "- 1>e 1 

for some a > 2 and 8 > 0. 

Fan and Li [17] suggested using a = 3.7 from a Bayesian point of view and 
this value will be used in our numerical implementation. 

There are two issues related to the practical implementation of the proce- 
dure. First, to facilitate the implementation we use only one regularization 
parameter for all variables which can have very different scales. Thus, we 
need to standardize variables before using (4.1). Since each variable in (4.1) 
is used locally around a given point wq, its sample mean and standard de- 
viation should be defined locally. For example, the variable Z\ at the point 
wq can be standardized by 



1 n 

ave(Zi|wo) = j7^2 K h(Wi -wq)Z u 

i=l 

and 

vav(Z 1 \w ) = i^Y^K h (Wi - w )Zf i - ave(Zi|u;o) 2 A^ , 

where N = Y2i=i KhiWi — wq). The second issue is that the number of vari- 
ables as a function of wq, if not constant, will be discontinuous. This will 
lead to discontinuous estimates of coefficient functions. This may not be bad 
in terms of overall prediction error, but does not produce parsimonious and 
appealing models. To avoid this, we use a simple voting rule: if a coefficient 
function is estimated as zero over a certain percentage of grid points, delete 
its corresponding variable; otherwise keep the variable. In our implementa- 
tion, we use the majority voting rule, namely, the thresholding percentage 
is taken as 50%. 

4.2. Oracle property. We now establish an oracle property of the penal- 
ized local partial-likelihood estimator. We assume without loss of generality 
that the first s variables of Z are significant and the last p — s variables 
are not significant. To state our main result more explicitly, we need the 
following notation. 

Recall that £= (6 ,rj T ,j) T . We divide d into (<5f,<5^) T , where S\ and 
#2 are sxl and (p-s)xl vectors, representing, respectively, the vanishing 
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and nonvanishing coefficients. Corresponding to the partition of 3, we divide 
rj into (r]i iVl) 7, ■ Write 

£l = ( S l>Vl,l) T = (6,1, 6,2s, 6,2 S +i) T 

and £ 2 = (52,r/^) T . Let £ 10 = (£i,i i0 , • • • ,£i,2 S +i,o) T , and £20 and £o be , 
respectively, the true values of £ 2 and 6 For example, £ij',0 = Pjo(wo) 
for j = 1, . . . , s, £i,j,o = /3jo( w o) for j = s + 1, . . . , 2s and £i,2 S +i,o = S'o(wo)- 
Without loss of generality, assume that £ 20 = 0- Set 

an (wo) = m ax{f4(|6,j,o|) :6,i,o / 0}, 

b n (w ) = max{^(|6,i,o|) -Ci,j,o + 0}. 

Let 111 and Ai be, respectively, the submatrices of II(r, Wq) and A(r, Wo) 
in (A. 10) and (A. 16) in Appendix A that correspond to the rows in £j. 
Corresponding to the partition of d, let T" 1 = (r^ r 1 ,r^ 2 ) T with r_i and 
r_ 2 being s x p and (p — s) x p matrices, respectively. 

The following theorem shows how the rates of convergence for the penal- 
ized local partial-likelihood estimates depend on the regularization parame- 
ter. 

Theorem 6. Suppose that Conditions A.1-A.8 in the Appendix A hold. 
If b n (wo) — > 0, then there exists a local maximizer £ p of Q(£) such that 

U P -£oll = P (h 2 + (nh)- 1 / 2 + a n (w )). 

It is clear from Theorem 6 that by choosing a proper g, such that a n (wo) = 
0({nh)~ l l 2 + h 2 ), there exists a (nh)~ 1 / 2 + h 2 consistent penalized local 
partial-likelihood estimator. Now we show that this estimator must possess 
an oracle property. 

Theorem 7. Assume that the penalty function p e {9) satisfies 
(4.3) liminfliminfp'(l9)/o>0. 

n->oo 0->O+ 

Let g -> 0, {(nh)- 1 / 2 + h 2 } / g -> and a n (w ) = 0{(nh)~ 1 / 2 + h 2 ). Under the 



T C T > 

2p) 



in 



conditions of Theorem 6, the consistent local maximizer £ p = 
Theorem 6 satisfies the following statements with probability tending to 1; 

(a) (Sparsity) We have £ 2p = • 

(b) (Asymptotic normality) We have 



'nM^i^-^) 

(4.4) 



Bf 1 H{ 1 h + h 2 p 2 



s+ i 



^(0,ni(r,w„)), 
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where b = 0^(16, i,o|) sgn(£ 1)1)0 ), ... ,p' e (|£i,2 S +i,o|) sgn(£i i2s+ i i0 )) T , Bi = 
A a - H^ExHr 1 , Si = diag{p / ;(|6,i,o|),...,P , ;(|6,2s+i,o|)}, ft,W = 
(Pw(w ),^2o(w ), . . . ,p s0 (w ),0) T , and Hi is a (2s + 1) x (2s + 1) dia#o- 
naZ matrix with first s elements 1 and the last s + 1 elements h. 

We now explain that the penalized local-likelihood estimators possess an 
oracle property when penalty functions are properly chosen. Suppose that 
there is an oracle who knows £ 2p = 0- She then uses this knowledge to es- 
timate resulting in an oracle estimator. From Theorem 2, the asymp- 
totic covariance matrix of this oracle estimator is -V A^ 1 IIi(r, wq)A. : [ 1 . For 
penalty functions such as SCAD, since g—>0, for sufficiently large n, 

o-n{wo)=0 and b n (wo) = so b = and Sx=0. 

Thus, Theorems 6 and 7 yield that £ 2p = and fli(£i p — £ 10 ) is asymptoti- 
cally normal with covariance matrix -V Aj~ 1 IIi(r, i«o)A7/ , which is the same 
as the asymptotic variance of the oracle estimator (see Theorem 2). Further- 
more, it can easily be seen that both estimators share the same asymptotic 
bias. Thus, the penalized likelihood estimators perform as well as the ora- 
cle estimator when the penalty functions are constant at the tails. In other 
words, when the true parameters have some zero components, they are es- 
timated as with probability tending to 1 and the nonzero components are 
estimated as well as the case where the correct submodel is known. 

5. Numerical examples. 

5.1. Simulations. In this section we first compare the performance of the 
one-step and local partial-likelihood estimators. The performance of estima- 
tor (3(-) is assessed via the weighted mean square error (WMSE), 

i P "grid 

(5.1) WMSE = ]T £ OjlPjiwk) - f3j(w k )] 2 , 

™ g rid j=1 k=1 

or the unweighted mean square error (UMSE) with all a,- = 1, where {w k , k = 
!)•••) n grid} are the grid points at which the functions (3(-) are estimated. In 
the following examples, the Gaussian kernel will be used, ?ig r id = 200 and, 
for WMSE, Oj is reciprocal to the sample variance of {/3j(wk)}- 

Example 1. We first consider the varying-coefficient model X(t) = 4t 3 x 
exp{b{Z 1 (t),Z 2 ,W)} with 

b{Z 1 ,Z 2 , W) = 0.5W(1.5 - W)Z l + sin(2VF)Z 2 

+ 0.5{exp(VF - 1.5) - exp(-1.5)}, 
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(a) Performance comparisons (b) Estimated curve for p 1 
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(d) Estimated curve for g 
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Fig. 1. Simulation results for Example 1. (a) Boxplots for the distribution for the WMSE 
over the 200 replications, using the three bandwidths h = 0.2, 0.5, 1 (from left to right), (b), 
(c) and (d) Typical estimates of f3 1 (-), /3 2 (-) and g(-) with bandwidth h — 0.2 (solid line, 
true function; dashed line, one-step LPLE, i.e., OS). 

where W is a random variable uniformly distributed on [0, 3] , the covariate 
Zi(t) is time-dependent, defined as Z\(t) = Z\j\I(t < 1) + Z\l{i > 1), and 
Z\ and Zi are jointly normal with correlation 0.5, each with mean and 
standard deviation 5. The censoring random variable C given (Z±, Z2,W) is 
distributed uniformly on [0,a(Zi,Z2,W)], where 

a(Z 1 ,Z 2 ,W) = ci/(6(Zi, Z 2 , W) > b ) + 02/(6(^,^2, W) < b ), 

with 60 being the mean function of b(Z\, Z 2 ,W). The constants c\ = 0.8 and 
C2 = 20 are chosen so that about 30-40% of data are censored in each region 
of the function a(-). 

We have conducted 200 simulations with sample size 300. Figure 1(a) 
depicts the distribution for the WMSE over the 200 replications, using the 
three bandwidths /i = 0.2,0.5,l. The initial value is chosen at grid points 
^20)^60)^100) ^140 and iiiso by the local partial- likelihood estimator just 
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mentioned in Section 3.1. It is evident that the performances of the one- 
step local partial-likelihood estimator (one-step LPLE) and local partial- 
likelihood estimator (LPLE) are comparable for a wide range of bandwidths. 
Figure l(b)-(d) presents estimates of the coefficient functions from a typical 
sample (attaining the median WMSE performance) with h = 0.2. 

We now test the accuracy of our standard error formula given in Sec- 
tion 3.3. The standard deviations, denoted by SD in Table 1, of 200 esti- 
mated Pi(wq), foiwo) and g'(wo), based on 200 simulations, can be regarded 
as the true standard errors. The average and the standard deviation of 200 
estimated standard errors, denoted by SE avc and SE st d, summarize the over- 
all performance of the standard error formula. Table 1 presents the results 
at the points w = 0.3,0.75,1.5,2.25 and 2.7, which correspond to the 10th, 
25th, 50th, 75th and 90th percentiles of the distribution of W . The perfor- 
mance of the standard error formula is quite satisfactory. 

Example 2. In the following examples, we evaluate the performance of 
the proposed variable selection method. Samples of size 300 were simulated 
from the hazard regression model 

A(t) = exp (j2 Z3P3W + 9{tv)j , 

where pi(w) = 3(w-2) 2 , 2 (w) = Acos{^~^) and 3 (w) = 4 (w) = g(w) = 
0. The covariates Z\,Z 2 , Z% and Z4 are jointly normal, all with mean and 
variance 2, and pairwise correlation 0.6. They are independent of W, which 
is uniformly distributed on [0,3]. The censoring time follows the uniform 
distribution on [0, 7] so that about 30-40% of the data were censored. The 
kernel function is Gaussian. 

The performance of the proposed variable selection technique is compared 
with that of the maximum local partial-likelihood estimator from the full 
model and from the oracle estimator, which is based on the model with only 

Table 1 

True and estimated standard errors using bandwidth — 0.2 for Example 1 
0i(w o ) 02(w o ) S'(wo) 

Wo SD SE a ve (SE s td) SD SEave (SE s td) SD SE ave (SE s td) 

0.30 0.0606 0.0573 (0.0098) 0.0655 0.0479 (0.0111) 0.3831 0.3735 (0.0492) 

0.75 0.0458 0.0479 (0.0076) 0.0579 0.0337 (0.0079) 0.2779 0.2967 (0.0354) 

1.50 0.0340 0.0414 (0.0058) 0.0473 0.0236 (0.0043) 0.1910 0.2457 (0.0258) 

2.25 0.0303 0.0343 (0.0046) 0.0282 0.0197 (0.0018) 0.1873 0.1602 (0.0228) 

2.70 0.0429 0.0385 (0.0053) 0.0321 0.0222 (0.0027) 0.2491 0.1474 (0.0178) 
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Performance comparisons 
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Oracle Scad Full 

Fig. 2. Boxplot for the distribution of the UMSE over the 200 replications, using band- 
widths h — 0.3 and A = 0.3. 

covariates Z\ and Zi- Figure 2 depicts the distribution for the UMSE over 
the 200 replications, using bandwidths h = 0.3 and A = 0.3. It is evident 
that the proposed variable selection procedure outperforms the maximum 
local partial-likelihood estimator and performs comparably with the oracle 
estimator. 

Using the majority voting (50%) rule, the variables Z3, Z4 and g(W) were 
simultaneously deleted 98.5% of the time among 200 simulations, and using 
a 60% thresholding level, the variables Z3, Z4 and g(w) were simultaneously 
deleted 92% of the time. Hence, only variables Z\ and Zi remain. Their 
estimated coefficients are depicted in Figure 3 for a typical sample. 



5.2. Data analysis. The proposed approaches are now applied to the 
nursing home data set analyzed by Morris, Norton and Zhou [29], where a 
full description of this data set is given. The data are from an experiment 
sponsored by the National Center for Health Services Research during 1980- 
1982 that involved 36 for-profit nursing homes in San Diego, California, with 
a sample of size 1601. 

The study was designed to evaluate the effects of different financial incen- 
tives on, among other things, the duration of stay. This motivated Morris, 
Norton and Zhou [29] to take days T in the nursing home as the response 
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(a) Estimated curve of pi (b) Estimated curve of p 2 




variable. They used the model 

X(t,x) =A (i)exp^^:Cj/3jJ, 

where x\ is a treatment indicator, being 1 if treated at a nursing home and 
otherwise; X2 is a gender variable (1 for males and for females); X3 is a 
marital status indicator (1 if married and otherwise); x^^x^^xq are three 
binary health status indicators that correspond to the best health to the 
worst health; xj is age, which ranges from 65 to 104. Morris, Norton and 
Zhou [29] fitted the Cox model with three parametric and one nonparametric 
baseline hazard model to this data set. Their model does not include any 
possible interactions between age and other variables. To explore possible 
interaction, Fan and Li [17] added interaction terms such as x?xi, X7X2, ■■ ■ 
in the initial model. With our newly developed technique, we can fit the 
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(a) p for gender (b) p for health 1 (c) p for health2 




more general model 

X(t,x) = \ (t) exp(^2Pj( x 7) x j +5(^7)^ • 

This permits us to examine how different age groups interact with covariates 
such as treatment, gender and marital status. In fact, as age increases, elderly 
people would expect to stay at nursing homes longer. Therefore, it is natural 
to introduce the term 5(2:7), the varying intercept. 

The local partial-likelihood method was applied to the data set with 
bandwidth h = 15, which was chosen by if-fold cross-validation [8, 25] to 
minimize the prediction error JJ~(iVj(i) — ENi(t)) 2 d{J2^ =1 N^t)}, where 
ENi(t) = f*Yi(u) exp0(Wi) T Zi(u) + g(Wi)}\ (u) du is the estimate of the 
expected failure number up to time t. We chose K = 20. Here, examination of 
the resulting estimated coefficient functions and their 95% confidence bands 
(not presented here) suggests that variable treatment and marital status are 
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not very significant. We therefore applied the variable selection technique 
to the data with A = 0.02 and bandwidth h = 15. The coefficient function 
for the treatment effect was estimated as zero at 89.5% of grid points, the 
coefficient function for marital status was estimated as zero at 97.9% of 
grid points and they were simultaneously estimated as zero at 87.5% of 
grid points. Thus, the variables treatment indicator and marital status were 
deleted. In other words, there is no significant treatment effect even when 
the more objective model (less restrictive model than in [29] and [18]) is 
used. Applying the local log partial-likelihood method (2.4) to the remain- 
ing five variables, we obtained estimated coefficient functions as in Figure 4 
above. These functions depict the extent to which the gender effect and the 
health effect vary with age, and indicate clearly that the risk of staying at 
a nursing home depends on age. 

APPENDIX A: PROOFS 

A.l. Notation and conditions. For easy reference, we collect a set of no- 
tation and conditions to be used. Let (fi, T, P^ i9t \)) be a family of complete 
probability spaces provided with a history F = {J-t}t for an increasing right- 
continuous filtration Tt C T . We assume that W% is .Ft-measurable, and 
Ni(u) and Zj(it) are F-adapted. Write Tt = c{Xj < u,Zi(u),Wi,Yi(u),i = 
1,2,... ,n,0 < u < t} and Mi(t) = Ni(t) - / * A»(«) dit, i = 1,2,..., n. Obvi- 
ously, Mi{t) is an Tt martingale. 

Let || • || denote the L2-norm and let || • ||j be the sup- norm of a function 
or a process on a set J. The support of the random variable W is denoted 
by W. For a compact subset Jw of W, we define the neighborhood set of 
J w>£ as 

Jwe = \w: inf \w — wo\<e> 

for some e > 0. 

To facilitate technical arguments, we will reparametrize the local partial 
likelihood (2.4) via the transformation £ = H(£ — £ ). Hence, the logarithm 
of the local partial-likelihood function is 

l n (t,Q=l n {H- 1 t + £ ,t) 

= -E / K h (Wi-w ) 

x [C T U*(«) +£x*(u)-logS n0 (u,C,w )]dM i (u) 
+ -E / ^(^-^o)[C T U*(n)+^X*(n)-log5 n oKC,^o)] 
x Fi(«)exp(/3 (W i fZi(n) + g (Wi))X (u)du, 
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where U*(n) = H _1 X?(«) and 



S nk (u, C, w ) = J2 K h{W l - w Q )Y i (u) exp(C T U*(n) + ^X*(u))(V*(u)f 



i=l 

fc = 0,l,2. 



Furthermore, for each u G [0, r] and fc = 0, 1, 2, we write ^ n (C) = ^(C> r ) an d 
define 

n 

S; fc (n,0, U ;o) = ^K^^-u;o)^(n)exp(/3 T (^OZ i H + 9(^O)(U*(n)f fc , 

where £(•) = (/3 T (-), /3'(-) T , 9(-)) T , »(•) = (/3 T (-), 5 (-)) T and w G J w . 

Let f(wo) be the density of the random variable W. In addition to the 
notation introduced before Theorem 2, we also define, for wq G Jw,e, 

s* (u,6,w ) = f(w )E[p(u,Z(u),w )\W = w ], 

sl(u,8,w ) = f(w )E[p(u,Z( U ),w )(Z T (u),0,0) T \W = w ], 



s* 2 (u,G,wq) = f(w )E 



p(u, Z(u),wq) ex.p((3(w ) T Z(u) + g(w )) 
Z(u)Z T (u) 



W = Wq 



Z{u)Z T {u)p 2 , Z(u)p 2 

Z T (u)p 2 , M2 

and 

s fc (u,C,iwo) 

= /M y E[P(uMu),w )^(C^ ,Z(u),y)^u(y)® k \W = w }K(y)dy, 
where jfe = 0, 1,2, R u (y) = (Z T (u), Z T (u)y, y) T and 

*(C,4 0) z>2/) = exp |c T R«(y) + £o 

To facilitate notation, the arguments 6q(w) = (0q (w), go(w)) T , £o(w), £ = 
and wq are omitted in S^ k (t, 6, wo), S n k(t, C, fo), s£(t, 6, wo) and Sfc(i, wo) 
whenever there is no ambiguity. For example, 

S nk(t) = S nk(^ W o) = Snk(t, e 0,W ), S* k (t) = S* k (t,W ) = S* k (t,6 ,W ), 

S n k(t) = S nk (t,w ) = S nk (t,0,w ), s k (t) = s k (t,w ) = s k (t,0,w ), 

Snk(t,C) = S nk (t,C,W ), S k (t,C) = S k (t,C,W ). 
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Condition A. 

1. The kernel function K > is a bounded, symmetric density function with 
compact support. 

2. The functions and <?(■) have continuous second-order derivatives 
around the point wq- 

3. The density function /(•) of W is continuous at the point wq and f(wo) > 
0. 

4. The conditional probability P(u, Z('u),-) is equicontinuous at w;o and the 
covariate Z(u) is continuous. 

5. We have nh — > oo and n/i 5 is bounded. 

6. We have Jq Xo(t)dt < oo. 

7. (Lindeberg condition) There exists 5 > such that 

(n/i)" 1 / 2 sup |Z l (t)|y i (t)/(/3^(^ )Z i (t) > -5|Zi(t)|) 0, 
te[o,r],ieJV r 

where A/" = {1, 2, ... , n}. 

8. (Asymptotic variance) The matrix a2 — Jq ai ^^}^ oAo(n) is positive 
definite at the point i«o an d the matrix ^) is nonsingular at the point 
Wo- 

Condition A will be used to derive the pointwise convergence properties 
of £ and its asymptotic normality. Conditions A.1-A.5 are similar to those 
in [16] and Conditions A.7-A.8 are similar to Conditions C and D of [2]. 
Condition A. 7 seems complicated, but can be easily verified in some impor- 
tant cases. For example, when the covariates Z are bounded, the condition 
is always satisfied; if the covariates Z are bounded by a random variable 
that has a bounded rth moment for some constant r > 2, the condition also 
holds. Other cases can be found in [2]. To derive the uniformly consistent 
result, Condition A needs to be strengthened as follows. 

Condition B. 

1. The kernel function K > is a bounded, symmetric density function with 
compact support. 

2. The functions /3 (-) and go(-) have continuous second-order derivatives 
on Jw,e- 

3. The conditional probability P(u,Z(u),w) is equicontinuous in the argu- 
ments (u,w) on [0, r] x Jy/,e- 

4. The compact set Jw C W has the property inf^gj^ f(w) > for some 

£>0. 

5. The covariate process Z(u) has continuous sample paths in a subset Z of 
the continuous function space, and Jq Xo(t) dt < oo and J w < 00 • 
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6. The function so(t, 9,Wo) is bounded away from on the product space 
[0, r] x C x Jw,e, that is, 

inf inf inf so(t, 9, wq) > 
te[o,T] (p T ,g)ec w o£Jw,£ 

and 

sup sup E\Z(t)\ k exp(/3 T Z(t)+g) < oo, 

i6[0,r] (/3 T ,g)eC 

where CcR p+1 . 

7. We have nh/logn — ► oo and n/i 5 is bounded. 

8. (Asymptotic variance) The matrix &2 — Jq ai ^~7^j ( M ) is positive 
definite for any wo G JvKe and the matrix (ff. ai ) is nonsingular for every 

A.2. Proof of main results. Let 

n 

C„(t) = n~ l Yi(t)g(Wi, {W { - w )/h, Zi(t))K h (Wi - w ) 
i=i 

for a function g(-,-,-). 

Lemma A.l. Assume that Conditions A.l and A. 4 ZioW. Suppose that 
<?(-,v) * s continuous in its three arguments and that E(g(W, u, Z(t))\W = 
wo) is continuous at the point wq. Ifh—>0 in such a way that nh/logn — ► oo, 

sup |C„(t) — C(t)| -AO, 

0<t<T 

w/iere C(t) = /(™ ) f £(1^)5(^0, Z(t))|W = w )K(u) du. 

Proof. It is easy to show that for every t € [0, r], 
(A.l) \C n (t) - C(t)\ 0. 

Now we divide [0, r] into M subintervals tj], i = 1, 2, . . . , M, with max- 
imum length 5. Then 

(A.2) max \C n {U) - C{U)\ 0. 

1<!<M 

Note that 

sup \C n (t)-C(t)\ 

0<t<T 

(A.3) < max \C n (ti) - C(U)\ 

l<i<M 

+ max sup |C n (t)-<7(t)-(C n -<7fo_i))|. 

l<i<M lt _ u _ ll<S 
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The first term on the right-hand side is asymptotically negligible. We now 
deal with the second term. Write 

g(W, (W - w )/h, Z) = g+(W, (W - w )/h, Z) - g~(W, (W - w )/h, Z), 

where g + (-,-,•) and g~ (-,-,•) are the positive part and negative part of 
#(•,•,•)> respectively. Correspondingly, we decompose C n (t) into C+(i) and 
C~ (t) . We only need to show that 



^max^ sup |C„ (t) — C„ + ^max^ sup \C + (t) — C + {ti^i) 

(A.4)" 



^<i<M \t~U^\<S 1<*<M \t-U-i\<8 



-^0 

and a similar result for C~{t). We now focus on (A.4). It will be shown in 
Appendix B that 

(A.5) max sup \C+(t) - C+(i*-i)| 0. 

l<*<Af |<j 

On the other hand, we have 

max sup |C + (t)-C + (ti_i)| 
i<»<Af i^^i^ 

< max sup f(w ) [ E{Y(t)[g + (w ,u,Z(t)) 

- g + (wo,u,Z(ti-i))]\W = w } 

(A.6) 

x K(u) du 



+ max sup 



J E{I(t^ 1 <X<t i )g+(w ,u,Z(t^ 1 ))\W = w } 



X if (it) Gilt 



which tends to zero as 5 — ► 0. Hence (A.4) holds. This completes the proof. 
□ 



Lemma A. 2. Assume that g(w,u,Z(t)) is equicontinuous in its argu- 
ments w and u, and that E(g(wo,u,Z(t))\W = wq) is equicontinuous in the 
argument wq. Under Conditions B.3 and A, we have 

sup sup \C n (t,w ) -C(t,w )\ — >0, 
0<t<Tio eB 

where B is a compact set that satisfies inf w£ b f(w) > 0. 
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The proof of Lemma A. 2 is similar to that of Lemma A.l and is omitted. 



Lemma A. 3. Let C and D be compact sets in R d and BP, and let 
f(x,0) be a continuous function in 9 £ C and x G D. Assume that Oq{x) is 
continuous in x G D and is the unique maximizer of f(x,9). Let n (x) G C 
be a maximizer of f n (x,6). If 



then 



sup \f n (x,0)-f(x, 0)|—O, 



sup \0 n (x) - {x) 

x&D 



The proof of Lemma A. 3 can be found in [11]. 

Lemma A. 4. Under Condition A, we have for k = 0, 1,2, 
n~ 1 S* k (u) = s* k (u) + o p (l), 
uniformly for u G (0,r], where s k (u) = s^.(u, 6q,wo) and 

sup Hn -1 ^^, 0,u;o) ~ s* k (u,6,w )\\ =o p (l), 

«6(0,t] 

where 6 lies in a neighborhood of 6q for fixed wo . In addition, we have for 
each C, 

sup ||rc _1 S nfe (it,C,wo) - s k (uX,w )\\ =o p (l), 

w€(0,r] 

where C H es in a neighborhood of for fixed wq. Furthermore, under Con- 
dition B, we have 

\\ n ~ ls nk~ 4\\n = o p (l), 
where 1Z = [0,r] x C x Jw,e and a similar result holds for S nk (u,£,Wo). 

The results of Lemma A. 4 can be easily proved along similar lines to the 
arguments establishing Lemma A.l. 

Proof of Theorem 1. The first result of Theorem 1 follows from the 
first step in the proof of Theorem 2. Now we only prove the second result of 
Theorem 1. By an argument similar to that in the first step in the proof of 
Theorem 2, we easily prove from Lemma A. 2 that 

sup sup sup \i n {t,C)-L(t,0)-Y(t,C)\ — 
te[o,-r] £ £C*w gJw 
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in probability; here C* is a convex and compact set of R 2p+1 . Therefore, it 
follows from Lemma A. 3 that sup WQeJ \Q — > in probability. The proof is 
complete. □ 



Proof of Theorem 2. We first prove that VnhH(£(wo) — £0(^0)) is 
asymptotically normal with mean h 2 e p ^Q(wo)/j,2/2 and covariance S(r, Wq). 
Now we divide the proof of the asymptotic normality of VnhH(i^(wo) — 
^o(tuo)) into three steps. The first step is to show that H(£(u;o) — £ (wo)) — ► 
in probability. The second step is to establish the asymptotic normality of 
the first derivative of the local partial likelihood. The third step is to demon- 
strate that the Hessian matrix of the local partial-likelihood function con- 
verges to a positive definite one. Theorem 2 will then be proved by combining 
the results in these three steps. 

(a) We first show that £ — ► in probability, where £ = H(£ — £ ). It is 
easy to show that 



L(tx) - L(t,o) 



E / K h(Wi 

~1 Jo 



(A.7) 



n — jo 



w 0/ 



C T U*(n)-log 



Sno(u,C) 



S n0 (u,Q) 

+ -/ s nl (u) CM u ) du -- lQ g q 7 n s S., 



dMj(u) 



n {u)\ (u) du 



:=X n {tX) + Y n {tX). 
By Lemma A.l we obtain that 

Y n (tX)= [\s$(u)) T t\ ( U )du- [\ O g^p^-s* (u)X ( U )du + o p (l) 
Jo Jo s (u,U) 

:=Y(tX)+o p (l). 

In Appendix B, we will show that Y(t, £) is a strictly concave function in 
£ and has maximum value at C = 0- The process X n (tX) is a local square 
integrable martingale with the square variation process 



D n (t) = (x n (-x),x n (-x))(t) 
1 n ft 



C T U*(n)-log 



( S n0 (uX) 



xYi^eMPoiWifZi 
It follows from Lemma A.l that 

EXl(t, = ED n (t) = 0{(nh)- 1 ) - 



\S nQ (u,0). 
u) + go(Wi))\ (u) du. 



< t < T. 
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Hence, we have that 

£n(t,C) ~ L(t,0) = Y(t,C) + O p ((n/i)- 1/2 ). 

Obviously, £ n (t,C) ~ £n(t,0) is strictly concave in £ with the maximizer £. 
By the concavity lemma it follows that £ — ► 0, the maximizer of Y(t,Q in 
probability. 

(b) We now show that ynh(£' n (r, 0) — B n (r, Wq)) is asymptotically normal 
with mean zero and covariance S(r, wq), where the definitions of B n (r, wo) 
and £(t, wo) can be found below. 

Observe that 



1 " PT 

n^A) L 

U?(u) 



5 re i(it,u;o) 



1 n l-T 

+ "E / ^h(Wi-too) 



i=l 



S n0 (u, w 
S nl (u,w ) 



dMiiu) 



S n o(u,W ) 



x expiP^WifZ^u) +g (W i ))Y i (u)X (u)du. 

Let us denote the above two terms, respectively by I\(t, 0) and ^(^O). We 
first deal with I2(t,0). By Taylor expansion we have 



(Ai 



Note that 



exp(/3 (T^) T Z^) + gt,(Wi)) -exp(^X* + 5 oK)) 
= |exp(^oXj* + 5o(^o))[/3o( w o) T Zi(u) + g'o(w )] 
x (Wi-?i;o) 2 (l + Op(/i)). 



1 



i=l 



h(r, 0) = - E / " ^o) U*(n 



S n .l(u) 



x [exp( / 3 (W i ) T Z i (u) +g (Wi)) - exp(#X? +g (w ))} 
x li(u)Ao(tt) c?u. 
Then it follows from Lemmas A.l and A. 4 that 



1 

2^ 



i=i 



E / ^(wi-itfo; 



u?(«) 



*o(«). 

x Y^exp^X* +g (wo))lP' \w ) T Z l (u) + <?oK))] 
x (Wi - w ) 2 \ {u) du(l + o p {h)) 



'Z(«)W\ 

Z(u) M 3 
^3 / 



s*(u) 



p(u,Z(u),w ) 
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x [^(w ) T Z(u)+g > / (w )]\W = w 



x Ao(it) du(l + O p (h)), 

where s^(u) = s^,(u, Oq, wq) for k = 0, 1, 2. Since K(-) is a symmetric function, 
which implies /i3 = 0, simple algebra shows that 



X / £ 

Jo 



'Z(u) - ai(ti)/a (n)' 





(A.9) 



'/3 Vor 



xp(«,Z(n),^ )(Z J ,0,l) 



,j"(i«o) 



dA (w) 



x (l + O p (/i)) 



i/ l 2 /i 2 e P r- 1 /3 '(u;o)(l + O p (/ l )). 



Let us denote the term in (A.9) by B n (r, wo). 

We now derive the asymptotic normality of the term Ii(r, 0). Let I* if) 
VnWi(t,0). Then 



S n o(u) 



n i=1 Jo 

x Y,(u) exp(/3 (^) T Z 4 (n) + gd(Wi))A («) d«. 
By Lemma A.l and using Conditions A.l and A. 8, it can be shown that 
n(r, too) 

= ( Jm^iJ,iT>(r) 

= /(«*>) 



(A.10) x / E 
Jo 



'(Z(u)- ai (u)/Mu)f % 










ft 







Z(u)Z J (u> 2 Z(u)^ 2 

Z T {u)v- 2 V2 



x p(u,Z(u),iw )| VF = w 



dAo(tt 



T-V 











a 2 ^2 ai^ 2 



a^ 2 a ^2, 
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By Condition A. 7 and a proof similar to that of Anderson and Gill [2], it is 
easy to prove that the Lindeberg condition for the process I*(t) holds. By 
the martingale central limit theorem, we derive that I£(t) is asymptotically 
normal with mean zero and covariance H(t,wo). Hence 



(A.ll) Vnh{£' n {0) -B n (T,w )) —►#((), II(t,tuo)). 

(c) We will show that the second derivative of the logarithm of the local 
partial-likelihood function converges to a finite constant matrix. Since £ — > 
in probability, by the mean-value theorem we have that 

(A.12) #(C)=#(0)+Op(l). 

Since st(u) = Sk(u) exp(<7o(Vo))i k = 0, 1,2, from Lemma A. 4, we can obtain 

5(0) -I f± kuw, - a , o) 4W 8 3W-»iW(»iWf ^ M + 0p(1) . 

n jo i=l \ s o\ u )) 

Write F w (u) = P(X < u, A = 1\W = w) and its corresponding empirical con- 
ditional measure, 

1 n 

F w {u) = -Y,K h {Wi- w )I{X t < u, Ai = 1). 

n t-t 

%=\ 

By kernel smoothing techniques, we easily prove that 

UO)--) dF w (u)+o p (l) 

(A.13) 

= -A(t,w ) +o p (l), 

where 

(4W 2 dFw[u) - 

It is easy to show that A(r, wq) is positive definite. 

(d) Combining the results in steps (a), (b) and (c), we can establish the 
asymptotic normality of \ZnhH.(^(wo) — £(w>o)). f ac t> si nce C maximizes 
^n(C)) by Taylor expansion around 0, we have 

where C, lies between and £. Hence C, — > in probability. It follows 
from (A.13) that 

C - A(r, u; ) _1 B n (r, u> ) 

= ~(C(C* ))" 1 (C(0) - B w (r, u*)) + o p (l). 
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Combining (A. 11) with (A. 13), by Slutsky's theorem we obtain that 

Vnh(C ~ A ( T , ^o)~ 1 B n (r, w )) 

— N(0, A-\t, w )n(r, w )(A^(t, w )f). 

Now we simplify the matrix A(r, Wo). Obviously, by a simple calculation 
we have 

'Z(u)Z T (u) 

Z(u)Z T (u)fi 2 Z(u)fi 2 

Z T {u)H2 



s* 2 (u) = f(w )E 



/'2 



(A.14) 



x p(u,Z(u), w )\W = w Q 



a 2 (u) 
a 2 (u)^2 ai(u)n2 | • 
af(n)^ 2 ao(u)n2, 

Similarly, we obtain that 

/ai(t*)af (it) 
(A.15) ( s *( n ))^ 2 = 

V o oo, 

Note that Sq(u) = &q(u). Hence it follows from (A.14) and (A.15) that 



(A.16) 



T 1 

A(t,w ) = ( a 2 ^ 2 ai/x 2 
aj nz a /j, 2 , 



Hence, the asymptotic bias of the estimator C( w o) is 
b(r,Wo) = A~ 1 (T,w )B n (T,w ) 
= h 2 e p £ \w )n 2 /2 

and the asymptotic covariance is 

S(r, w ) = A~ 1 (t , w )n(T, w ) (A" 1 (r, u; )) T 

Ti/ o 

T ( a 2 ai \ _ 2 
/j, 2 i/ 2 

V a i a o ' 

r o 

T Qfi2 2 V 2 

This completes the proof. □ 
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Proof of Theorem 3. We have shown from (A. 13) that 
(A.17) %(£*) = -A(t,w ) + o p (1) 

for any £ between zero and £ = H(£ — £ ). By Theorem 2, £ = O p (h 2 + 
(n/i)" 1 / 2 ). Thus, for any C* = O p {h 2 + (n/i)" 1 / 2 ), (A.17) holds. 
By Taylor expansion of ^(Co) a ^ Co = 0> we have 

(A.18) 4(C ) =i' n (Co) + ^(C*)(Co - Co), 

where Co = H(£ — £o) an d C = H(£ — £ ), in which £ lies between £ and 
to- 

By definition of the one-step estimator and (A.18), we have that 

Cos - Co = (Co - Co) - (^(Co)) _1 4(Co)- 

Using (A.17), we have 

Cos - Co = {i- (^(c ))^n(c*))(Co - Co) - (^(Cor^Co) 

= -(^(Co))- 1 4(Co) + o P (Co-C ) 

= -(^(Co))" 1 [^(Co)-Bn(r,«;o)]-^(Co)^Bn(r,^o) 

+ o p {(nhr l ' 2 + h 2 ). 

It follows from (A. 11) and (A. 13) that Cos h as the same asymptotic distri- 
bution as the maximum local partial-likelihood estimator. This yields The- 
orem 3. □ 

Proof of Theorem 4. By the same argument as that of Lemma A.l, 
we have 

(A.19) sup sup n - 1 \A n (t,0)-A n (t,0 o )\^O 

te[o,T] ||0-0 O ||<||0-0 () || 

in probability, where 

n 

A n (t, 0) = J2 € Jw)Yi{t) exp{f3 T (W i )Z l (t) + g(Wi)}, 

i=l 

where = ([3 T (■), g(-)) T . 

By definition of Ao(i), we have 

Ao(i) - Ao(t) = /'{-r4r - } dN n + f {-^r ~ ^o 

Jo lA n (0) A n (0 o )J Jo lA n (0 o ) 

' Mg^Mgg) dAo _ /' *.(»)- a. (ft,) ^ 

A„(fl) io A„(9)A„(9 ) 
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where N n = Ya=\ n * an d M n = Ya=i M i- From (A. 19) it is easy to see that 
the first term converges to zero in probability uniformly on (0, r] as u-> 
oo. The last two terms of the above expression are square integrable local 
martingales with variation processes 



(A„(9)PA„(9„) Jo A„(8„) 

respectively. Since A n (#o) = O p (n), the above variance processes converge 
to zero in probability uniformly on (0, r] as n — > oo. The terms converge to 
zero in probability uniformly on (0, r] by an argument similar to that of 
Andersen and Gill [2] via the Lenglart inequality. Therefore 

Ao(i)— >Ao(t) 

uniformly on (0,r]. Thus, we can prove by the standard argument of kernel 
estimation that 

A (t) — >Ao(t) 

uniformly on (0, r]. □ 

Proof of Theorem 5. From the proof of Theorem 2, we easily show 
that this theorem holds. □ 

Proof of Theorem 6. Using the same proof as in Theorem 2, we can 
get 

£' n (Ho)=0 P ((nh)- l / 2 + h 2 ). 

Let a n = (n/i)" 1 / 2 + h? + a n . Following the same lines as the proof of The- 
orem 1 of [17], the result follows. □ 

Lemma A. 5. Suppose that the conditions of Theorem 6 hold. Then with 
probability tending to 1, for any given £ 1 satisfying — £ 10 || = O p ((n/i) -1 / 2 4 
h?) and any constant C , we have 

Q((£F,0) T )= max Q((£F,£f) T ). 

||^ 2 ||<C , [(n/i)- 1 /2 + / l 2] 

Proof. From an argument similar to that in step (b) in the proof of 
Theorem 2, it is easy to show that 

£' n (tio) = P ((nhr 1/2 + h 2 ), 

and by an argument similar to that in step (c) of the proof of Theorem 2, 
we have 
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The result follows from the the proof of Lemma 1 of [17]. □ 



Proof of Theorem 7. It follows from Lemma A. 5 and Theorem 6 
that the first result of Theorem 7 holds. Now we prove the second result of 
Theorem 7. It can be easily shown that there exists a ^ 1 as in Theorem 6 that 
is a local maximizer of Q(£i , 0) T , and that satisfies the likelihood equations 



0£i 



= 0. 



Using the Taylor expansion of (dQ($,))/d^ 1 at point £ an d noting that £ x 
is a consistent estimator from Theorem 6, we have 

d£n(to) | ( 2 Uto) o m \ (t * , 

^r +( v^r +0p(1) J ( ^"^ io) 

(A.20) 

-6-(Si + o p (l))(| 1 -€io) = 0. 
From the proof of Theorem 2, it is easy to show that 



and 



Thus, we have 



■i ^(eo )xx-i 



H " n 7™ ^ H- 1 — >-A(r, W0 ). 



(A.21) 



and 



0& 2 

JV(0,ni(r,T^)) 



(A.22) HT 1 ^^^ 1 — AxCr,^). 

By some simple calculations, we easily show that the second result of The- 
orem 7 follows from (A.20), (A.21) and (A.22). □ 

APPENDIX B 



Concavity and maxima of Y(t,(3). Here we prove that Y(t,Q defined 
by (A. 7) is concave with respect to Differentiating the function Y(t, £) 
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with respect to £, we have 
dY(r,C) _ 



dC 

d 2 Y(rX) 

dC 2 



sI(u)Xq(u) du 



**i(«,C) 



o s (u,C) 

t s 2 (ux)s (ux)-(si(u,c)r 2 



Sq(u)\q(u) du, 







Sq(u)\q(u) du. 



(so(uX)) 2 

By the integral transform and the fact that aa T + bb T > 2ab T for any 
vectors a and b, we can show that 

3 2 Y(t, 0) 



dC 2 



<0. 



Again by s^(u, 0) = Sfc(u, #o) exp((7o(uio))i k = 0, 1, we have 

8Y(t,0) 



0. 



Hence C, = is the maximizer Y(t, £). 

Proof of (A. 5). It is easy to show that 



max sup |C+(i) - C+(ij_i)| < J\ + J2, 



Ki<M 



|t-ti-i|<<5 



where 



Ji = max sup 

l^i^lt-ti-il^ 



and 



J2 = max sup 



n- 1 Y / Y 1 (t)g + (W J ,(W J -w )/h,Z j (t))K h (W 1 -w ) 

n 



n 

xK^-^o) 

n 

- 1 ^Yjiti-Jg+iW,, (Wj - w )/h, ZjiU-!)) 

i=l 



n 
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Note that Zj (t) (J = 1, 2, ... , n) is continuous on [0, r] . Thus we easily obtain 
that 

Ji < max sup sup (W} — wo)/h, Zj(t)) 

l<J<n te [ 0)T ] |*_* < _ 1 _l<a 

-» + (W J -,(W,--«;o)A,Z i (t < _i))| 

n 

x sup n" 1 ^y i (t) J PC /l (W r i -w ), 
te[o,r] i= i 

which tends to zero in probability. Since Y^(£) is a decreasing function of t, 
we have, for any e > 0, 



>s) <MP(n _1 



3=1 



x (Wj - w ) /h, Z j )K h {W j - w ) 



> e 



It is easy to show that 



n- 1 Y,I(U-i<X J <t l )g + (W j ,(W J -wo)/h,Z j (t l ^ 1 ))K h (W j -wo) 

3=1 



f(wo) J E{I(t l _ 1 <X<t i )g + {wo,u,Z ] {t l _ 1 )\W = wo)}K{u)du. 
On the other hand, 

E(I(U-i <X< ti)g+{wo, u, Z(U-i))\W = w ) 
<E 1 / 2 {I(t i - 1 <X<t i )\W = w )} 

x E^ 2 {g +2 (wo,u, Z(U-i))\W = wo} 
= \P(X < U-i\W = wo) - P(X <U\W = w )\ 1/2 

x EV 2 (jg +2 {wo,u, Z{U^))\W = wo) 
< e 

as \ti — < 5. Hence 



Pi n 



J2i(u-i<x j <t i ) 

3=1 

x g+(Wj, {W 3 - w )/h, Zj{U-i))K h {Wj - w ) 



> £ 
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<P[ ^2n- 1 I(t i -i<X j <t i ) 



3=1 



x g+(Wj, (Wj - w )/h, Zj(ti-i))K h (Wj - w ) 
- f(w ) J E(I(U^ <X< U) 



x g + (w ,u, Z(ti-i))\W = w )K(u) du > e/2 



) 




x g + (w ,u, Z(ti-i))\W = w )K(u) du\ > e/2 



< 7]. 



Hence for any r\ > and e > there exists N§ such that for n > Nq we have 



This completes the proof of (A. 5). □ 
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